Discovering the Biomedical Deep Web
نویسندگان
چکیده
The rapid growth of biomedical information in the Deep Web has produced unprecedented challenges for traditional search engines. This paper describes a new Deep web resource discovery system for biomedical information. We designed two hypertext mining applications: a Focused Crawler that selectively seeks out relevant pages using a classifier that evaluates the relevance of the document with respect to biomedical information, and a Query Interface Extractor that extracts information from the page to detect the presence of a Deep Web database. Our anecdotes suggest that combining focused crawling with query interface extraction is very effective for building high-quality collections of Deep Web resources on biomedical topics. 1 Project and System Overview This research tackles two issues: Firstly, a lot of information on the web does not have an explicit URL and is not indexed by traditional search engines. This information is stored in content rich databases that are collectively called the Deep Web [3]. Our system aims to identify these Deep Web databases. Secondly, domain-specific web portals are growing in importance, as they improve usability. Our goal is to build a system that can discover sources for biomedical information in the deep web, while automatically maintaining and updating the information. We choose the biomedical domain for our system due to its importance and the lack of existing portals that contains extensive information on it (especially those available from the Deep Web). The system has three main components: a classifier which makes judgments on pages crawled to decide on link expansion, a focused crawler [1] with dynamically changing priorities governed by the classifier, and a query interface extractor [2] which uses information from the page to determine the presence of a deep web database. 2 Experiments and Results The core part of our experiments is to validate our reasons for using a focused crawler as opposed to a traditional crawler and to measure the success of determining deep web databases that contain biomedical information. The experiments are divided into three parts: 1) Determine the right parameters for the classifier, 2) Measure the effectiveness of the focused crawler to find biomedical data quickly and efficiently, and 3) Locate query interfaces in relevant pages to identify biomedical deep web databases. Measuring the Effectiveness of the Focused Crawler The classifier was trained on a biomedical corpus [4] using TFIDF to produce an accuracy of 90.4% and an F-measure of 94.1% which indicates that the system identifies Discovering the Biomedical Deep Web 617 most positives and negatives accurately but does have a few false results as well. The most crucial evaluation of our focused crawler is to measure the rate at which relevant biomedical pages are acquired, and how effectively irrelevant pages are filtered off from the crawl. Harvest Ratio of Traditional vs Focused Crawler 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 200 400 600 80
منابع مشابه
Using Deep Learning Towards Biomedical Knowledge Discovery
A vast amount of knowledge exists within biomedical literature, publications, clinical notes and online content. Identifying hidden, interesting or previously unknown biomedical knowledge from free text resources using an automated approach remains an important challenge. Towards this problem, we investigate the use of deep learning methods that have shown significant promise in identifying hid...
متن کاملDiscovering Biomedical Relations Utilizing the World-Wide Web
To crate a Semantic Web for Life Sciences discovering relations between biomedical entities is essential. Journals and conference proceedings represent the dominant mechanisms of reporting newly discovered biomedical interactions. The unstructured nature of such publications makes it difficult to utilize data mining or knowledge discovery techniques to automatically incorporate knowledge from t...
متن کاملFocused Crawling of the Deep Web Using Service Class Descriptions
Dynamic Web data sources—sometimes known collectively as the Deep Web—increase the utility of the Web by providing intuitive access to data repositories anywhere that Web access is available. Deep Web services provide access to real-time information, like entertainment event listings, or present a Web interface to large databases or other data repositories. Recent studies suggest that the size ...
متن کاملService Class Driven Dynamic Data Source Discovery with DynaBot
Dynamic Web data sources – sometimes known collectively as the Deep Web – increase the utility of the Web by providing intuitive access to data repositories anywhere that Web access is available. Deep Web services provide access to real-time information, like entertainment event listings, or present a Web interface to large databases or other data repositories. Recent studies suggest that the s...
متن کاملResearch on discovering deep web entries
Ontology plays an important role in locating Domain-Specific Deep Web contents, therefore, this paper presents a novel framework WFF for efficiently locating Domain-Specific Deep Web databases based on focused crawling and ontology by constructing Web Page Classifier(WPC), Form Structure Classifier(FSC) and Form Content Classifier(FCC) in a hierarchical fashion. Firstly, WPC discovers potential...
متن کامل